Introduction to Medical Statistics 2024
Exercise 6 – Simple Linear Regression
Data Analysis and Model Diagnostics
Exercise 6 – Simple Linear Regression
Data Analysis and Model Diagnostics
Exercise 6 – Simple Linear Regression
In this exercise, we use the package ggplot2.
Exercise i) (Simple Linear Regression – HIV-negative CM patients) As in the exercises of day 1, we use the dataset cmTbmData.csv containing information on 201 patients with meningitis from 4 different patient groups. However, for this session, we will restrict attention to the 49 HIV-negative patients with cryptococcal meningitis. We will examine how well blood white cell count can predict CSF white cell count in this group.
- Import the dataset cmTbmData.csv and create a new data.frame cm.hivneg which contains HIV-negative patients with cryptococcal meningitis only.
- Perform a linear regression with CSF white cell count as the outcome (response variable) and blood white cell count as a covariable (explanatory variable) and interpret the output. Calculate the 95% confidence intervals for the regression coefficients. Use the functions lm, summary, and confint. What do you conclude from the model results?
Add the fitted regression line to the scatterplot. (You can use the GUI in the ggplotgui package to obtain the scatterplot with the fitted regression line.) By looking at the plot, do you think that the model assumptions are fulfilled?
- Perform diagnostic plots for the fitted model using plot(fit). Interpret the residuals. Do they indicate any problems regarding the assumptions of the linear regression model? (Some further explanation of diagnostic plots in R can be obtained at http://data.library.virginia.edu/diagnostic-plots/ (https://easystats.github.io/performance/) )
An alternative is to use the resid_panel function frim the ggResidpanel package. Using the argument plot=“R” gives the same four plots.
- Create two new variables log10.bldwcc and log10.csfwcc containing log10-transformed values of the original data and then perform steps b) and c) again for the log-transformed variables. What do you conclude?
- The “Residuals vs Leverage” plot suggests that there’s one individual that may have a large impact on the parameter estimates. Identify this point and perform steps b) and c) again for the log-transformed data with that observation removed. Comments? Do you see any other individual that may not follow the model assumptions?
Alternative solution:
- What CSF white cell count does the model from e) predict for a patient with a white cell count in blood of 10x10^3/mm³, i.e., with log10.bldwcc = 1? Calculate a 95% prediction interval for log10.csfwcc in a patient with log10.bldwcc = 1.